202 research outputs found
Self-consistent redshift estimation using correlation functions without a spectroscopic reference sample
We present a new method to estimate redshift distributions and galaxy-dark
matter bias parameters using correlation functions in a fully data driven and
self-consistent manner. Unlike other machine learning, template, or correlation
redshift methods, this approach does not require a reference sample with known
redshifts. By measuring the projected cross- and auto- correlations of
different galaxy sub-samples, e.g., as chosen by simple cells in
color-magnitude space, we are able to estimate the galaxy-dark matter bias
model parameters, and the shape of the redshift distributions of each
sub-sample. This method fully marginalises over a flexible parameterisation of
the redshift distribution and galaxy-dark matter bias parameters of sub-samples
of galaxies, and thus provides a general Bayesian framework to incorporate
redshift uncertainty into the cosmological analysis in a data-driven,
consistent, and reproducible manner. This result is improved by an order of
magnitude by including cross-correlations with the CMB and with galaxy-galaxy
lensing.
We showcase how this method could be applied to real galaxies. By using
idealised data vectors, in which all galaxy-dark matter model parameters and
redshift distributions are known, this method is demonstrated to recover
unbiased estimates on important quantities, such as the offset
between the mean of the true and estimated redshift distribution and the 68\%
and 95\% and 99.5\% widths of the redshift distribution to an accuracy required
by current and future surveys.Comment: 20pages, 11 figures, text revised for clarification, version accepted
by journal, conclusions unchange
Does stellar mass assembly history vary with environment?
Using the publicly available VESPA database of SDSS Data Release 7 spectra,
we calculate the stellar Mass Weighted Age (hereafter MWA) as a function of
local galaxy density and dark matter halo mass. We compare our results with
semi-analytic models from the public Millennium Simulation. We find that the
stellar MWA has a large scatter which is inherent in the data and consistent
with that seen in semi-analytic models. The stellar MWA is consistent with
being independent (to first order) with local galaxy density, which is also
seen in semi-analytic models.
As a function of increasing dark matter halo mass (using the SDSS New York
Value Added Group catalogues), we find that the average stellar MWA for member
galaxies increases, which is again found in semi-analytic models. Furthermore
we use public dark matter Mass Accretion History (MAH) code calibrated on
simulations, to calculate the dark matter Mass Weighted Age as a function of
dark matter halo mass. In agreement with earlier analyses, we find that the
stellar MWA and the dark matter MWA are anti correlated for large mass halos,
i.e, dark matter accretion does not seem to be the primary factor in
determining when stellar mass was compiled. This effect can be described by
down-sizing.Comment: 11 pages, 3 figures, submitted to MNRA
Learning from the machine: interpreting machine learning algorithms for point- and extended- source classification
We investigate star-galaxy classification for astronomical surveys in the
context of four methods enabling the interpretation of black-box machine
learning systems. The first is outputting and exploring the decision boundaries
as given by decision tree based methods, which enables the visualization of the
classification categories. Secondly, we investigate how the Mutual Information
based Transductive Feature Selection (MINT) algorithm can be used to perform
feature pre-selection. If one would like to provide only a small number of
input features to a machine learning classification algorithm, feature
pre-selection provides a method to determine which of the many possible input
properties should be selected. Third is the use of the tree-interpreter package
to enable popular decision tree based ensemble methods to be opened,
visualized, and understood. This is done by additional analysis of the tree
based model, determining not only which features are important to the model,
but how important a feature is for a particular classification given its value.
Lastly, we use decision boundaries from the model to revise an already existing
method of classification, essentially asking the tree based method where
decision boundaries are best placed and defining a new classification method.
We showcase these techniques by applying them to the problem of star-galaxy
separation using data from the Sloan Digital Sky Survey (hereafter SDSS). We
use the output of MINT and the ensemble methods to demonstrate how more complex
decision boundaries improve star-galaxy classification accuracy over the
standard SDSS frames approach (reducing misclassifications by up to
). We then show how tree-interpreter can be used to explore how
relevant each photometric feature is when making a classification on an object
by object basis.Comment: 12 pages, 8 figures, 8 table
Feature importance for machine learning redshifts applied to SDSS galaxies
We present an analysis of importance feature selection applied to photometric
redshift estimation using the machine learning architecture Decision Trees with
the ensemble learning routine Adaboost (hereafter RDF). We select a list of 85
easily measured (or derived) photometric quantities (or `features') and
spectroscopic redshifts for almost two million galaxies from the Sloan Digital
Sky Survey Data Release 10. After identifying which features have the most
predictive power, we use standard artificial Neural Networks (aNN) to show that
the addition of these features, in combination with the standard magnitudes and
colours, improves the machine learning redshift estimate by 18% and decreases
the catastrophic outlier rate by 32%. We further compare the redshift estimate
using RDF with those from two different aNNs, and with photometric redshifts
available from the SDSS. We find that the RDF requires orders of magnitude less
computation time than the aNNs to obtain a machine learning redshift while
reducing both the catastrophic outlier rate by up to 43%, and the redshift
error by up to 25%. When compared to the SDSS photometric redshifts, the RDF
machine learning redshifts both decreases the standard deviation of residuals
scaled by 1/(1+z) by 36% from 0.066 to 0.041, and decreases the fraction of
catastrophic outliers by 57% from 2.32% to 0.99%.Comment: 10 pages, 4 figures, updated to match version accepted in MNRA
Tuning target selection algorithms to improve galaxy redshift estimates
We showcase machine learning (ML) inspired target selection algorithms to
determine which of all potential targets should be selected first for
spectroscopic follow up. Efficient target selection can improve the ML redshift
uncertainties as calculated on an independent sample, while requiring less
targets to be observed. We compare the ML targeting algorithms with the Sloan
Digital Sky Survey (SDSS) target order, and with a random targeting algorithm.
The ML inspired algorithms are constructed iteratively by estimating which of
the remaining target galaxies will be most difficult for the machine learning
methods to accurately estimate redshifts using the previously observed data.
This is performed by predicting the expected redshift error and redshift offset
(or bias) of all of the remaining target galaxies. We find that the predicted
values of bias and error are accurate to better than 10-30% of the true values,
even with only limited training sample sizes. We construct a hypothetical
follow-up survey and find that some of the ML targeting algorithms are able to
obtain the same redshift predictive power with 2-3 times less observing time,
as compared to that of the SDSS, or random, target selection algorithms. The
reduction in the required follow up resources could allow for a change to the
follow-up strategy, for example by obtaining deeper spectroscopy, which could
improve ML redshift estimates for deeper test data.Comment: 16 pages, 9 figures, updated to match MNRAS accepted version. Minor
text changes, results unchange
Combining clustering and abundances of galaxy clusters to test cosmology and primordial non-Gaussianity
We present the clustering of galaxy clusters as a useful addition to the
common set of cosmological observables. The clustering of clusters probes the
large-scale structure of the Universe, extending galaxy clustering analysis to
the high-peak, high-bias regime. Clustering of galaxy clusters complements the
traditional cluster number counts and observable-mass relation analyses,
significantly improving their constraining power by breaking existing
calibration degeneracies. We use the maxBCG galaxy clusters catalogue to
constrain cosmological parameters and cross-calibrate the mass-observable
relation, using cluster abundances in richness bins and weak-lensing mass
estimates. We then add the redshift-space power spectrum of the sample,
including an effective modelling of the weakly non-linear contribution and
allowing for an arbitrary photometric redshift smoothing. The inclusion of the
power spectrum data allows for an improved self-calibration of the scaling
relation. We find that the inclusion of the power spectrum typically brings a
per cent improvement in the errors on the fluctuation amplitude
and the matter density . Finally, we apply this
method to constrain models of the early universe through the amount of
primordial non-Gaussianity of the local type, using both the variation in the
halo mass function and the variation in the cluster bias. We find a constraint
on the amount of skewness () from the
cluster data alone.Comment: 12 pages, 10 figures, 2 tables. Minor changes to match published
version on MNRA
Anomaly detection for machine learning redshifts applied to SDSS galaxies
We present an analysis of anomaly detection for machine learning redshift
estimation. Anomaly detection allows the removal of poor training examples,
which can adversely influence redshift estimates. Anomalous training examples
may be photometric galaxies with incorrect spectroscopic redshifts, or galaxies
with one or more poorly measured photometric quantity. We select 2.5 million
'clean' SDSS DR12 galaxies with reliable spectroscopic redshifts, and 6730
'anomalous' galaxies with spectroscopic redshift measurements which are flagged
as unreliable. We contaminate the clean base galaxy sample with galaxies with
unreliable redshifts and attempt to recover the contaminating galaxies using
the Elliptical Envelope technique. We then train four machine learning
architectures for redshift analysis on both the contaminated sample and on the
preprocessed 'anomaly-removed' sample and measure redshift statistics on a
clean validation sample generated without any preprocessing. We find an
improvement on all measured statistics of up to 80% when training on the
anomaly removed sample as compared with training on the contaminated sample for
each of the machine learning routines explored. We further describe a method to
estimate the contamination fraction of a base data sample.Comment: 13 pages, 8 figures, 1 table, minor text updates to macth MNRAS
accepted versio
Implications of multiple high-redshift galaxy clusters
To date, 14 high-redshift (z>1.0) galaxy clusters with mass measurements have
been observed, spectroscopically confirmed and are reported in the literature.
These objects should be exceedingly rare in the standard LCDM model. We
conservatively approximate the selection functions of these clusters' parent
surveys, and quantify the tension between the abundances of massive clusters as
predicted by the standard LCDM model and the observed ones. We alleviate the
tension considering non-Gaussian primordial perturbations of the local type,
characterized by the parameter fnl and derive constraints on fnl arising from
the mere existence of these clusters. At the 95% confidence level, fnl>467 with
cosmological parameters fixed to their most likely WMAP5 values, or fnl > 123
(at 95% confidence) if we marginalize over WMAP5 parameters priors. In
combination with fnl constraints from Cosmic Microwave Background and halo
bias, this determination implies a scale-dependence of fnl at approx. 3 sigma.
Given the assumptions made in the analysis, we expect any future improvements
to the modeling of the non-Gaussian mass function, survey volumes, or selection
functions to increase the significance of fnl>0 found here. In order to
reconcile these massive, high-z clusters with an fnl=0, their masses would need
to be systematically lowered by 1.5 sigma or the sigma8 parameter should be
approx. 3 sigma higher than CMB (and large-scale structure) constraints. The
existence of these objects is a puzzle: it either represents a challenge to the
LCDM paradigme or it is an indication that the mass estimates of clusters is
dramatically more uncertain than we think.Comment: 11 pages, 7 figures, modified to match published versio
Stacking for machine learning redshifts applied to SDSS galaxies
We present an analysis of a general machine learning technique called
'stacking' for the estimation of photometric redshifts. Stacking techniques can
feed the photometric redshift estimate, as output by a base algorithm, back
into the same algorithm as an additional input feature in a subsequent learning
round. We shown how all tested base algorithms benefit from at least one
additional stacking round (or layer). To demonstrate the benefit of stacking,
we apply the method to both unsupervised machine learning techniques based on
self-organising maps (SOMs), and supervised machine learning methods based on
decision trees. We explore a range of stacking architectures, such as the
number of layers and the number of base learners per layer. Finally we explore
the effectiveness of stacking even when using a successful algorithm such as
AdaBoost. We observe a significant improvement of between 1.9% and 21% on all
computed metrics when stacking is applied to weak learners (such as SOMs and
decision trees). When applied to strong learning algorithms (such as AdaBoost)
the ratio of improvement shrinks, but still remains positive and is between
0.4% and 2.5% for the explored metrics and comes at almost no additional
computational cost.Comment: 13 pages, 3 tables, 7 figures version accepted by MNRAS, minor text
updates. Results and conclusions unchange
- …